{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Продлит ли клиент подписку (Ахметов Данил Маратович)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Описание набора данных"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "В задаче необходимо предсказать, продлит пользователь сервиса подписку или нет. Задача бинарной классификации."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Список переменных **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1 - id\tидентификатор.  \n",
    "2 - taxactionSystem\t система налогообложения (категориальный).  \n",
    "3 - regdt  Дата регистрации (число).  \n",
    "4 - workerCount   количество сотрудников (число).  \n",
    "5 - fssdccount   количество отправленных отчетов в ФСС из этой организации из БК за все время существования (число).  \n",
    "6 - pfrdcCount   количество отправленных отчетов в ПФР из этой организации из БК за все время существования (число).  \n",
    "7 - fnsdcCount   количество отправленных отчетов в ФНС из этой организации из БК за все время существования (число).  \n",
    "8 - hasCloudCryptCertificate   был ли когда-либо в этой организации выпущен облачный сертификат (бинарный).  \n",
    "9 - OrgCreationDate  дата добавления организации в БК Это не дата регистрации организации в ФНС и т.д., это дата, когда организация была добавлена (создана) в БК (число).  \n",
    "10 - documentsCount  количество документов. Считает количество документов в системе (которые показываются на вкладке \"\"Все\"\") в этом количестве учитываются не все документы ) (число).  \n",
    "11 - cnt_users   количество пользователей (число).  \n",
    "\n",
    "Целевая переменная 12 - is_prolong  - продлится пользователь или нет. (бинарная: 1, 0)\n",
    "\n",
    "Формат файла ответов:  \n",
    "id, is_prolong  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Импортируем бииблиотеки**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*- \n",
    "import pandas as pd\n",
    "import math\n",
    "import re\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import classification_report, accuracy_score,roc_auc_score, roc_curve, confusion_matrix\n",
    "from sklearn.preprocessing import LabelEncoder, OneHotEncoder,StandardScaler,LabelBinarizer\n",
    "from sklearn.datasets import fetch_20newsgroups, load_files\n",
    "from sklearn.model_selection import GridSearchCV,StratifiedKFold\n",
    "from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier\n",
    "from sklearn.metrics import roc_auc_score,precision_recall_curve,roc_curve\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.learning_curve import validation_curve,learning_curve\n",
    "\n",
    "import xgboost as xgb\n",
    "from evolutionary_search import EvolutionaryAlgorithmSearchCV\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "#import sys  \n",
    "#reload(sys)  \n",
    "#sys.setdefaultencoding('utf-8')\n",
    "\n",
    "# настройка внешнего вида графиков в seaborn\n",
    "sns.set_style(\"dark\")\n",
    "sns.set_palette(\"RdBu\")\n",
    "sns.set_context(\"notebook\", font_scale = 1.5, \n",
    "                rc = { \"figure.figsize\" : (15, 5), \"axes.titlesize\" : 18 })\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Загрузим данные**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train=pd.read_csv('../../data/prolongation_service_train.csv',sep='\\t',encoding='cp1251',parse_dates=['regdt','OrgCreationDate'])\n",
    "test=pd.read_csv('../../data/prolongation_service_test.csv',sep='\\t',encoding='cp1251',parse_dates=['regdt','OrgCreationDate'])\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Выведем основные харакетристики переменных:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(train.shape)\n",
    "train.describe(include = \"all\").T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Визуализируем данные**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Распределение целевой переменной"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print 'Распределение целевой переменной: \\t'\n",
    "sns.countplot(train.is_prolong)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Имеется небольшой дисбаланс классов, избавимся от него дальше."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Корреляция"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(train.corr(method='pearson'),xticklabels=True,yticklabels=True);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(train.corr(method='spearman'),xticklabels=True,yticklabels=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Корреляция признаков \"count\" обусловлена тем, что чем чаще клиент пользуется сервисом и отправляет отчетность, тем больше он захочет продлить сервис. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Предобработка данных"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.concat([train,test],ignore_index=True)\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Удалим столбец id, так как он не несет никакой информации\n",
    "df.drop(columns=['id'],axis=1,inplace=True)\n",
    "df.fillna(value=0,inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Заменим строчки с кривой датой\n",
    "#Дата 0001-01-01 обусловлена default value в sql server, т.е при вставке данных в таблицу не была указана дата \n",
    "df['regdt'] = df['regdt'].replace(['0001-01-01 00:00:00.0000000'],u'2013-07-16') # 2013-07-16 медиана\n",
    "#приведем дату в нужный формат\n",
    "df[['regdt','OrgCreationDate']]=df[['regdt','OrgCreationDate']].apply(pd.to_datetime)\n",
    "df['regdt'] = df['regdt'].fillna(value=pd._libs.tslib.Timestamp('2013-07-16 00:00:00'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Приведем колонки к формату int\n",
    "columns = df.select_dtypes(['floating']).columns\n",
    "df[columns] = df[columns].astype('int64')\n",
    "\n",
    "#Работаем с датой, достанем год, месяц, день\n",
    "df['regdt_year']=pd.DatetimeIndex(df['regdt']).year\n",
    "df['regdt_month']=pd.DatetimeIndex(df['regdt']).month\n",
    "df['regdt_day']=pd.DatetimeIndex(df['regdt']).day\n",
    "\n",
    "df['OrgCreationDate_year']=pd.DatetimeIndex(df['OrgCreationDate']).year.astype('int64')\n",
    "df['OrgCreationDate_month']=pd.DatetimeIndex(df['OrgCreationDate']).month.astype('int64')\n",
    "df['OrgCreationDate_day']=pd.DatetimeIndex(df['OrgCreationDate']).day.astype('int64')\n",
    "\n",
    "df['delta_year']=df['regdt_year']-df['OrgCreationDate_year']\n",
    "\n",
    "#Попробуем добавить кол-во лет с момента регистрации\n",
    "df['today-regdt_year']=2018-df['regdt_year']\n",
    "df['today-OrgCreationDate_year']=2018-df['OrgCreationDate_year']\n",
    "\n",
    "df.drop(axis=1,columns=['regdt','OrgCreationDate'],inplace=True)\n",
    "\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Достанем из признака taxactionSystem систему налогооблажения и величину ставки**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_tax=pd.DataFrame(columns=['tax','stavka'])\n",
    "for idx,i in enumerate(df.taxactionSystem):\n",
    "    temp=i.split(', ')\n",
    "    if len(temp)!=2:\n",
    "        new_tax.loc[idx]=[temp[0],0]\n",
    "    else:\n",
    "        new_tax.loc[idx]=[temp[0],int(str(re.search(r'\\d+%', temp[1]).group(0))[:-1])]\n",
    "        \n",
    "df['tax']=new_tax['tax']\n",
    "df['stavka']=new_tax['stavka']\n",
    "df.drop(columns='taxactionSystem',axis=1,inplace=True)\n",
    "\n",
    "#train.dropna(axis=0, how='any',inplace=True)\n",
    "df['stavka'] = df['stavka'].astype('int64')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Закодируем категориальные признаки с помощью One hot encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_encoder = LabelEncoder()\n",
    "\n",
    "mapped_education = pd.Series(label_encoder.fit_transform(np.hstack((df['tax']))))\n",
    "mapped_education.value_counts().plot.barh()\n",
    "\n",
    "# integer encode\n",
    "integer_encoded = label_encoder.fit_transform(df['tax']).astype('int')\n",
    "\n",
    "onehot_encoder = OneHotEncoder(sparse=False)\n",
    "integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)\n",
    "onehot_encoded = onehot_encoder.fit_transform(integer_encoded)\n",
    "print(onehot_encoded[:5])\n",
    "\n",
    "# invert first example\n",
    "inverted = label_encoder.inverse_transform([np.argmax(onehot_encoded[0, :])])\n",
    "\n",
    "df=df.join(pd.DataFrame(onehot_encoded,columns=[\"tax\"+str(i) for i in range(onehot_encoded.shape[1])]),how='outer')\n",
    "df.drop(columns='tax',axis=1,inplace=True)\n",
    "#train.dropna(axis=0, how='any',inplace=True)\n",
    "df[df.columns] = df[df.columns].astype('int64')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print len(train), len(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = df[:][:len(train)]\n",
    "test = df[:][len(train):]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Разделим выборку на обучающую и тестовую**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target=train.is_prolong\n",
    "train.drop(axis=1,columns=['is_prolong'],inplace=True)\n",
    "train_X,test_X,train_y,test_y=train_test_split(train,target,shuffle=True,random_state=17,test_size=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Выровняем классы с помощью SMOTE**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from imblearn.over_sampling import SMOTE\n",
    "smote = SMOTE(kind='borderline1', random_state=17)\n",
    "train_X, train_y = smote.fit_sample(train_X, train_y)\n",
    "train_X=pd.DataFrame(train_X,columns=train.columns)\n",
    "\n",
    "sns.countplot(train_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Обучение моделей\n",
    "## Логистическая регрессия"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "В качестве метрики будем использовать ROC_AUC, так как имеем дело с несбалансированной выборкой, и порог, равный 0.5, может оказывается не оптимальным "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Напишем функцию для получения метрики roc-auc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getROC_AUC(method,test_X,test_y):\n",
    "    precision, recall, thresholds = precision_recall_curve(test_y, method.predict_proba(test_X)[:,1])\n",
    "    plt.title(\"precision-recall\")\n",
    "    plt.plot(recall,precision)\n",
    "\n",
    "    fpr, tpr, thresholds = roc_curve(test_y, method.predict_proba(test_X)[:,1])\n",
    "    plt.figure()\n",
    "    plt.title(\"ROC-AUC\")\n",
    "    plt.plot(fpr,tpr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Попробуем применить LR с подбором параметров с помощью EvolutionaryAlgorithmSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr=LogisticRegression(n_jobs=-1)\n",
    "lr_pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression())])\n",
    "\n",
    "paramgrid = {\"lr__C\": np.logspace(-8, 1, 10),\n",
    "             \"lr__penalty\"     : ['l1','l2']}\n",
    "\n",
    "np.random.seed(17)\n",
    "\n",
    "cv = EvolutionaryAlgorithmSearchCV(estimator=lr_pipe,\n",
    "                                   params=paramgrid,\n",
    "                                   scoring=\"roc_auc\",\n",
    "                                   cv=StratifiedKFold(n_splits=4),\n",
    "                                   verbose=1,\n",
    "                                   population_size=10,\n",
    "                                   gene_mutation_prob=0.10,\n",
    "                                   gene_crossover_prob=0.5,\n",
    "                                   tournament_size=3,\n",
    "                                   generations_number=10,\n",
    "                                   n_jobs=4)\n",
    "cv.fit(train_X, train_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Построим модель с лучшими параметрами, и посмотрим качество на отложенной выборке"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = LogisticRegression(n_jobs=-1,penalty='l1',C=1)\n",
    "scaler_lr=StandardScaler()\n",
    "\n",
    "train_X_scale=scaler_lr.fit_transform(train_X)\n",
    "test_X_scale=scaler_lr.transform(test_X)\n",
    "\n",
    "lr.fit(train_X_scale, train_y)\n",
    "lr_pred=lr.predict(test_X_scale)\n",
    "print accuracy_score(test_y, lr_pred)\n",
    "print roc_auc_score(test_y, lr.predict_proba(test_X_scale)[:,-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Оценим важность признаков**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Лассо обнуляет веса ненужных признаков"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = pd.DataFrame(lr.coef_.reshape((20,1)),\n",
    "                        index=train_X.columns, \n",
    "                        columns=['Importance']).sort_values(['Importance'], ascending=False)\n",
    "features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Видим, что наибольший вклад внёс признак hasCloudCryptCertificate. Компании, у которых имеется электронаая подпись, чаще продливают подписку, это подтверждено практикой."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Случайный лес"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Сделаем тоже самое для Случайного леса"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnd_forest=RandomForestClassifier(n_jobs=-1,random_state=17,oob_score=False)\n",
    "\n",
    "forest_params = {'max_depth': [5,6,7,8,9,10]\n",
    "                ,'max_features': [1,2,3,4,5,6,7]\n",
    "                ,'min_samples_leaf':[1,2,3,5,7,8]\n",
    "                ,'n_estimators':[70,80,90,100,110,125,135,150]\n",
    "                }\n",
    "\n",
    "np.random.seed(17)\n",
    "\n",
    "cv = EvolutionaryAlgorithmSearchCV(estimator=rnd_forest,\n",
    "                                   params=forest_params,\n",
    "                                   scoring=\"roc_auc\",\n",
    "                                   cv=5,\n",
    "                                   verbose=1,\n",
    "                                   population_size=100,\n",
    "                                   gene_mutation_prob=0.10,\n",
    "                                   gene_crossover_prob=0.5,\n",
    "                                   tournament_size=3,\n",
    "                                   generations_number=10,\n",
    "                                   n_jobs=8)\n",
    "cv.fit(train_X, train_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Построим лучшую модель**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnd_forest=RandomForestClassifier(n_jobs=-1,n_estimators=150,max_depth=10,max_features=5,min_samples_leaf=1,random_state=17)\n",
    "rnd_forest.fit(train_X,train_y)\n",
    "pred_forest=rnd_forest.predict(test_X)\n",
    "print accuracy_score(test_y,pred_forest)\n",
    "print roc_auc_score(test_y, rnd_forest.predict_proba(test_X)[:,-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Наиболее важные признаки"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = pd.DataFrame(rnd_forest.feature_importances_,\n",
    "                        index=train_X.columns, \n",
    "                        columns=['Importance']).sort_values(['Importance'], ascending=False)\n",
    "features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Случайный лес, также в качестве наиболее важного признака выбрал hasCloudCryptCertificate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**График \"важности\" признаков**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(range(len(features.Importance.tolist())), \n",
    "         features.Importance.tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## График кривой обучения"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_with_std(x, data, **kwargs):\n",
    "        mu, std = data.mean(1), data.std(1)\n",
    "        lines = plt.plot(x, mu, '-', **kwargs)\n",
    "        plt.fill_between(x, mu - std, mu + std, edgecolor='none',\n",
    "                         facecolor=lines[0].get_color(), alpha=0.2)\n",
    "        \n",
    "def plot_learning_curve(clf, X, y, scoring, cv=5):\n",
    " \n",
    "    train_sizes = np.linspace(0.05, 1, 20)\n",
    "    n_train, val_train, val_test = learning_curve(clf,\n",
    "                                                  X, y, train_sizes, cv=cv,\n",
    "                                                  scoring=scoring)\n",
    "    plot_with_std(n_train, val_train, label='training scores', c='green')\n",
    "    plot_with_std(n_train, val_test, label='validation scores', c='red')\n",
    "    plt.xlabel('Training Set Size'); plt.ylabel(scoring)\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_learning_curve(RandomForestClassifier(n_jobs=-1,n_estimators=150,\n",
    "                                           max_depth=10,max_features=5,\n",
    "                                           min_samples_leaf=1,random_state=17),\n",
    "                   train_X, train_y, scoring=None, cv=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Построим валидационную кривую для данных параметров леса. В качестве параметра сложности будем использовать max_depth:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_validation_curve(clf, X, y, cv_param_name, \n",
    "                          cv_param_values, scoring):\n",
    "\n",
    "    val_train, val_test = validation_curve(clf, X, y, cv_param_name,\n",
    "                                           cv_param_values, cv=5,\n",
    "                                                  scoring=scoring)\n",
    "    plot_with_std(cv_param_values, val_train, \n",
    "                  label='training scores', c='green')\n",
    "    plot_with_std(cv_param_values, val_test, \n",
    "                  label='validation scores', c='red')\n",
    "    plt.xlabel(cv_param_name); plt.ylabel(scoring)\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_depth = range(1,10)\n",
    "plot_validation_curve(RandomForestClassifier(n_jobs=-1,n_estimators=150,\n",
    "                                           max_depth=10,max_features=5,\n",
    "                                           min_samples_leaf=1,random_state=17), train_X, train_y, \n",
    "                      cv_param_name='max_depth', \n",
    "                      cv_param_values=max_depth,\n",
    "                      scoring=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Найдём лучшие параметры для ещё нескольких алгоритмов и попробуем EnsembleClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Градиентрный бустинг"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Попробуем градиентный бустинг над деревьями"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g_boost=GradientBoostingClassifier(random_state=17,learning_rate=0.6)\n",
    "boost_params = {'max_depth': range(1,10)\n",
    "                ,'max_features': range(1,13)\n",
    "                ,'min_samples_leaf':range(1,5)\n",
    "                ,'n_estimators':range(10,150,10)\n",
    "                ,'learning_rate':np.arange(0.01,1,0.05)\n",
    "                }\n",
    "np.random.seed(17)\n",
    "cv = EvolutionaryAlgorithmSearchCV(estimator=g_boost,\n",
    "                                   params=boost_params,\n",
    "                                   scoring=\"roc_auc\",\n",
    "                                   cv=5,\n",
    "                                   verbose=1,\n",
    "                                   population_size=50,\n",
    "                                   gene_mutation_prob=0.10,\n",
    "                                   gene_crossover_prob=0.5,\n",
    "                                   tournament_size=3,\n",
    "                                   generations_number=10,\n",
    "                                   n_jobs=8)\n",
    "cv.fit(X_train, train_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Построим лучшую модель**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g_boost=GradientBoostingClassifier(random_state=17,n_estimators=120,max_depth=4,learning_rate=0.16,max_features=5,min_samples_leaf=4)\n",
    "g_boost.fit(train_X,train_y)\n",
    "pred_boost=g_boost.predict(test_X)\n",
    "print accuracy_score(test_y,pred_boost)\n",
    "print roc_auc_score(test_y, g_boost.predict_proba(test_X)[:,-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## xgboost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# x_boost=xgb.XGBClassifier()\n",
    "# X_boost_params = {'max_depth': range(1,10)\n",
    "#                 ,'n_estimators':range(10,150,10)\n",
    "#                 ,'learning_rate':np.arange(0.01,0.2,0.05)\n",
    "#                 }\n",
    "\n",
    "# np.random.seed(17)\n",
    "# cv = EvolutionaryAlgorithmSearchCV(estimator=x_boost,\n",
    "#                                    params=X_boost_params,\n",
    "#                                    scoring=\"accuracy\",\n",
    "#                                    cv=5,\n",
    "#                                    verbose=1,\n",
    "#                                    population_size=50,\n",
    "#                                    gene_mutation_prob=0.10,\n",
    "#                                    gene_crossover_prob=0.5,\n",
    "#                                    tournament_size=3,\n",
    "#                                    generations_number=10,\n",
    "#                                    n_jobs=8)\n",
    "# cv.fit(X_train, train_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Построим лучшую модель**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# x_boost=xgb.XGBClassifier(n_estimators=70,learning_rate=0.06,max_depth=4)\n",
    "# x_boost.fit(X_train,train_y)\n",
    "# pred_x_boost=x_boost.predict(X_test)\n",
    "# print accuracy_score(test_y,pred_x_boost)\n",
    "# print roc_auc_score(test_y, x_boost.predict_proba(X_test)[:,-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Метод ближайших соседей"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_jobs=-1,weights='distance',n_neighbors=14))])\n",
    "n_neighbors=range(2, 15)\n",
    "knn_params = {'knn__n_neighbors': n_neighbors}\n",
    "knn_grid = GridSearchCV(knn_pipe, knn_params,\n",
    "                         cv=5, n_jobs=-1,\n",
    "                        verbose=True)\n",
    "knn_grid.fit(train_X, train_y)\n",
    "print (knn_grid.best_params_, knn_grid.best_score_)\n",
    "print accuracy_score(test_y, knn_grid.predict(test_X))\n",
    "print roc_auc_score(test_y, knn_grid.predict_proba(test_X)[:,-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## EnsembleClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from brew.base import Ensemble, EnsembleClassifier\n",
    "from brew.stacking.stacker import EnsembleStack, EnsembleStackClassifier\n",
    "from brew.combination.combiner import Combiner\n",
    "from sklearn import clone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initializing Classifiers\n",
    "clf1 = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_jobs=-1,weights='distance',n_neighbors=14))])\n",
    "clf2 = RandomForestClassifier(n_jobs=-1,n_estimators=150,max_depth=10,max_features=5,min_samples_leaf=1,random_state=17)\n",
    "clf3 = GradientBoostingClassifier(random_state=17,n_estimators=110,max_depth=7,learning_rate=0.11,max_features=3,min_samples_leaf=3)\n",
    "clf4 = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression(n_jobs=-1,penalty='l2',C=10,random_state=17))])\n",
    "clf5 = RandomForestClassifier(n_jobs=-1,n_estimators=100,max_depth=10,random_state=17)\n",
    "\n",
    "# Creating Ensemble\n",
    "ensemble = Ensemble([clf1, clf2, clf3, clf4, clf5])\n",
    "eclf = EnsembleClassifier(ensemble=ensemble, combiner=Combiner('mean'))\n",
    "\n",
    "# Creating Stacking\n",
    "layer_1 = Ensemble([clf1, clf2, clf3, clf4])\n",
    "layer_2 = Ensemble([clone(clf5)])\n",
    "\n",
    "stack = EnsembleStack(cv=3)\n",
    "\n",
    "stack.add_layer(layer_1)\n",
    "stack.add_layer(layer_2)\n",
    "\n",
    "sclf = EnsembleStackClassifier(stack)\n",
    "\n",
    "\n",
    "# Loading some example data\n",
    "X = X_train\n",
    "y = train_y\n",
    "\n",
    "d = {yi : i for i, yi in enumerate(set(y))}\n",
    "y = np.array([d[yi] for yi in y])\n",
    "\n",
    "sclf.fit(X,y)\n",
    "print 'sclf ' + str(roc_auc_score(test_y,sclf.predict_proba(X_test)[:,-1]))\n",
    "eclf.fit(X,y)\n",
    "print 'eclf ' + str(roc_auc_score(test_y,eclf.predict_proba(X_test)[:,-1]))\n",
    "getROC_AUC(sclf,X_test,test_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Результаты ну прям не очень, попробуем что-нибудь по-серьезнее:)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Разделим обучающую выборку пополам, на первой половине обучим классификаторы, на второй ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_X,val_X,train_y,val_y=train_test_split(train,target,shuffle=True,random_state=17,test_size=0.2)\n",
    "train_X1,train_X2,train_y1,train_y2=train_test_split(train_X,train_y,random_state=17,test_size=0.7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.DataFrame()\n",
    "    \n",
    "lr=LogisticRegression(random_state=17)\n",
    "lr = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression(n_jobs=-1,penalty='l2',C=10,random_state=17))])\n",
    "lr.fit(train_X1,train_y1)\n",
    "data['lr'] = lr.predict_proba(train_X2)[:,-1]\n",
    "    \n",
    "random_forest = RandomForestClassifier(n_jobs=-1,n_estimators=70,max_depth=10,max_features=4,min_samples_leaf=3,random_state=17)\n",
    "random_forest.fit(train_X1,train_y1)\n",
    "data['random_forest'] = random_forest.predict_proba(train_X2)[:,-1]\n",
    "    \n",
    "gradient_boosting = GradientBoostingClassifier(random_state=17,n_estimators=140,max_depth=5,learning_rate=0.06,max_features=6,min_samples_leaf=4)\n",
    "gradient_boosting.fit(train_X1,train_y1)\n",
    "data['gradient_boosting'] = gradient_boosting.predict_proba(train_X2)[:,-1]\n",
    "    \n",
    "x_boost=xgb.XGBClassifier(n_estimators=70,learning_rate=0.06,max_depth=4)\n",
    "x_boost.fit(train_X1,train_y1)\n",
    "data['x_boost']=x_boost.predict_proba(train_X2)[:,-1]\n",
    "    \n",
    "kneighbors = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_jobs=-1,weights='distance',n_neighbors=14))])\n",
    "kneighbors.fit(train_X1,train_y1)\n",
    "data['kneighbors']=kneighbors.predict_proba(train_X2)[:,-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Функция, возвращающая предсказанные данные:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_predictions(models,train_X2):\n",
    "    data = pd.DataFrame()\n",
    "    for idx, i in enumerate(models):\n",
    "        data[str(idx)] = i.predict_proba(train_X2)[:,-1]\n",
    "    return data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_X2 = get_predictions([lr,random_forest,gradient_boosting,x_boost,kneighbors], train_X2)\n",
    "val_X = get_predictions([lr,random_forest,gradient_boosting,x_boost,kneighbors], val_X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Обучим простую логистическую регрессию:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr2 = LogisticRegression(C=0.5)\n",
    "lr2.fit(train_X2,train_y2)\n",
    "print accuracy_score(val_y,lr2.predict(val_X))\n",
    "print roc_auc_score(val_y,lr2.predict_proba(val_X)[:,-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Результат тоже не впечатляет."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Наибольшее значение ROC_AUC у случайного леса, поэтому применим его для формирования итогового прогноза."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnd_forest=RandomForestClassifier(n_jobs=-1,n_estimators=150,max_depth=10,max_features=5,min_samples_leaf=1,random_state=17)\n",
    "rnd_forest.fit(train,target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test.drop(axis=1,columns='is_prolong', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(np.stack((range(len(test)),rnd_forest.predict(test)), axis=1),columns=['id','is_prolong']).to_csv('submission.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv('submission.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "К сожалению, оценить наш прогноз не получится, так как мне ещё не предоставили ответы. Будем надеяться, что оценка на тестовой выборке, будет близка к оценке на кросс валидации."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
